fix: add per-domain RequestThrottler for 429 backoff by MrAliHasan · Pull Request #1762 · apify/crawlee-python

MrAliHasan · 2026-02-20T21:31:44Z

Fixes #1437

Problem

When target websites return HTTP 429 (Too Many Requests), the AutoscaledPool scales UP instead of down — creating a "death spiral." This happens because:

429 responses trigger SessionError → session retires → request retried
Less CPU work during retries → is_system_idle returns True
_autoscale() sees idle CPU → increases concurrency
More concurrent requests → more 429s → repeat

The existing _snapshot_client only tracks Apify storage API rate limits, not target website 429s.

Solution

Following @Pijukatel's suggestion, I created a dedicated RequestThrottler component that handles 429 backoff per domain — the AutoscaledPool is completely untouched.

Key features:

Per-domain tracking — rate limiting on example.com doesn't affect other-site.com
Exponential backoff — 2s → 4s → 8s → ... capped at 60s
Retry-After header support — parses both integer seconds and HTTP-date formats
Throttled requests are reclaimed — they go back to the queue, not dropped
Backoff resets on success — consecutive 429 count resets when a request succeeds

How it works

BasicCrawler.__run_task_function checks RequestThrottler.is_throttled(url) before processing
If the domain is throttled, the request is reclaimed (returned to queue for later)
When a 429 is detected in _raise_for_session_blocked_status_code, the domain is recorded
On successful request (RequestState.DONE), the backoff counter resets

Files changed

File	Change
`src/crawlee/_request_throttler.py`	NEW — Per-domain 429 tracker
`src/crawlee/crawlers/_basic/_basic_crawler.py`	Throttle check, 429 recording, success reset, Retry-After parsing
`src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py`	Pass URL + Retry-After header
`src/crawlee/crawlers/_playwright/_playwright_crawler.py`	Pass URL + Retry-After header
`tests/unit/test_request_throttler.py`	NEW — 13 unit tests

Tests

13 new tests covering: domain independence, exponential backoff, max delay cap, Retry-After priority, success reset, expiry, edge cases
8 existing autoscaling tests pass with zero regressions

Future work

This is a focused first step toward the full RequestAnalyzer that @Pijukatel outlined (with robots.txt integration, URL group management, etc.).

Add a new RequestThrottler component that handles HTTP 429 (Too Many Requests) responses on a per-domain basis, preventing the autoscaling death spiral where 429s cause concurrency to increase. Key features: - Per-domain tracking: rate limiting on domain A doesn't affect domain B - Exponential backoff: 2s -> 4s -> 8s -> ... capped at 60s - Retry-After header support (both seconds and HTTP-date formats) - Throttled requests are reclaimed to the queue, not dropped - Backoff resets on successful requests to that domain The AutoscaledPool is completely untouched - throttling happens transparently in BasicCrawler.__run_task_function before processing. Integration points: - BasicCrawler: throttle check, 429 recording, success reset - AbstractHttpCrawler: passes URL + Retry-After to detection - PlaywrightCrawler: passes URL + Retry-After to detection Closes apify#1437

vdusek · 2026-02-23T11:49:48Z

Hi @MrAliHasan, thanks for your contribution! We'll try to review this soon.

janbuchar

As mentioned in #1762 (comment), the approach of reclaiming throttled requests is not optimal.

On top of that, the solution to #1437 should probably also be extensible enough to also cover #1396 without much tweaking.

I believe that such solution could be implemented in crawlee-python quite easily. See similar issue for crawlee-js. The Python version already supports multiple "unnamed queues" via RequestQueue.open(alias="..."), so you'd only need to implement a ThrottlingRequestManager (implementation of the RequestManager interface) that would keep track of the per-domain queues and their delays.

Do you want to try it?

janbuchar · 2026-02-23T15:02:18Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+    @staticmethod
+    def _parse_retry_after_header(value: str | None) -> timedelta | None:


This has no business being in BasicCrawler. Better put it in the _utils module.

Moved to crawlee._utils.http.parse_retry_after_header in the refactor commit.

src/crawlee/crawlers/_basic/_basic_crawler.py

janbuchar · 2026-02-23T15:32:39Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+        # Check if this domain is currently rate-limited (429 backoff).
+        if self._request_throttler.is_throttled(request.url):
+            self._logger.debug(
+                f'Request to {request.url} delayed - domain is rate-limited '
+                f'(retry in {self._request_throttler.get_delay(request.url).total_seconds():.1f}s)'
+            )
+            await request_manager.reclaim_request(request)
+            return


If, at some point, the request queue contains only requests from a throttled domain, this will become a busy wait with extra steps. If you're using the Apify platform, this will cost a lot in request queue writes.

I'm afraid that this means we cannot accept the PR in the current state. See the main review comments for possible next steps.

Fully addressed in the refactor. The reclaim-based throttle block was removed. ThrottlingRequestManager.fetch_next_request() now handles scheduling and awaits asyncio.sleep() when all domains are throttled, eliminating busy-wait and extra queue writes.

MrAliHasan · 2026-02-23T16:44:07Z

Thanks for the detailed review. That makes sense regarding the busy-wait behavior and queue writes.
I’ll refactor this into a ThrottlingRequestManager implementation so that the throttling logic lives in the request scheduling layer rather than in BasicCrawler.
I’ll push an updated version soon. Appreciate the guidance.

Move per-domain throttling from execution layer (BasicCrawler.__run_task_function) to scheduling layer (ThrottlingRequestManager.fetch_next_request). - ThrottlingRequestManager wraps RequestQueue, implements RequestManager interface - fetch_next_request() buffers throttled requests and asyncio.sleep()s when all domains are throttled — eliminates busy-wait and unnecessary queue writes - Unified delay mechanism supports both HTTP 429 backoff and robots.txt crawl-delay (apify#1396) - parse_retry_after_header moved to crawlee._utils.http - 23 new tests covering throttling, scheduling, delegation, and crawl-delay Addresses apify#1437, apify#1396

janbuchar · 2026-02-24T11:49:42Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+            if self._throttling_manager:
+                self._throttling_manager.record_success(request.url)


I'd rather use isinstance so that an explicitly configured ThrottlingRequestManager also works.

Fixed! Updated to check isinstance(manager, ThrottlingRequestManager) accurately.

janbuchar · 2026-02-24T11:50:59Z

src/crawlee/crawlers/_basic/_basic_crawler.py

        Args:
            session: The session used for the request. If None, no check is performed.
            status_code: The HTTP status code to check.
+            url: The request URL, used for per-domain rate limit tracking.


Perhaps the parameter should be renamed to request_url so that it's not ambiguous with a URL after redirects.

Done. Renamed to request_url across BasicCrawler, AbstractHttpCrawler, and PlaywrightCrawler.

janbuchar · 2026-02-24T11:51:22Z

src/crawlee/crawlers/_basic/_basic_crawler.py

        """Raise an exception if the given status code indicates the session is blocked.

+        If the status code is 429 (Too Many Requests), the domain is recorded as
+        rate-limited in the `RequestThrottler` for per-domain backoff.


Outdated name

janbuchar · 2026-02-24T11:51:35Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+    # NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header
+


Suggested change

# NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header

janbuchar · 2026-02-24T11:52:13Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+        if not robots_txt_file:
+            return True
+
+        # Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396).


Suggested change

# Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396).

# Wire robots.txt crawl-delay into ThrottlingRequestManager

janbuchar · 2026-02-24T11:53:25Z

src/crawlee/crawlers/_basic/_basic_crawler.py

                configuration=self._service_locator.get_configuration(),
            )
+            self._throttling_manager = ThrottlingRequestManager(inner)
+            self._request_manager = self._throttling_manager


I'm not sure if we should use ThrottlingRequestManager by default - thoughts, @vdusek @Pijukatel, @Mantisus?

I think it makes sense to enable it by default. When the crawler runs without a proxy, 429 will only increase the load on the site. This will not benefit either the site or the crawler.

Using a proxy requires a little configuration, so I think that an additional parameter to disable throttling for 429 will not complicate this.

janbuchar · 2026-02-24T11:54:27Z

tests/unit/test_throttling_request_manager.py

Please make the tests conform to the existing test structure. Perhaps @vdusek could provide further guidance here?

Good call! I've removed all the Test... classes and rewritten the entirety of test_throttling_request_manager.py to use standard top-level async def test_...() functions with pytest fixtures to match test_request_list.py.

janbuchar · 2026-02-24T11:55:25Z

src/crawlee/request_loaders/_throttling_request_manager.py

+"""A request manager wrapper that enforces per-domain delays.
+
+Handles both HTTP 429 backoff and robots.txt crawl-delay at the scheduling layer,
+eliminating the busy-wait problem described in https://github.com/apify/crawlee-python/issues/1437.
+
+Also addresses https://github.com/apify/crawlee-python/issues/1396 by providing a unified
+delay mechanism for crawl-delay directives.
+"""


No need to mention github issues here if they will be resolved by this PR.

janbuchar · 2026-02-24T13:28:52Z

src/crawlee/request_loaders/_throttling_request_manager.py

+        """
+        self._inner = inner
+        self._domain_states: dict[str, _DomainState] = {}
+        self._buffered_requests: list[Request] = []


This is a fundamental deviation from what I described - instead of storing the requests in memory, we should create a special request queue in memory (see #1762 (review)) for each domain. The ThrottlingRequestManager will then delegate fetch_next_request calls to sub-queues that are not currently delayed.

You're absolutely right. This has been fully refactored. ThrottlingRequestManager now dynamically instantiates RequestQueue.open(alias=f"throttled-{domain}") for throttled domains. Requests are routed to these separate sub-queues to avoid memory pressure, and fetch_next_request pulls from them seamlessly when domain delays expire. State operations (like is_empty and get_handled_count) now aggregate cleanly across the inner queue and all sub-queues. This aligns with Crawlee's distributed design and ensures request persistence.

…queues and update its integration across crawlers.

MrAliHasan · 2026-02-25T01:21:06Z

Heads up @janbuchar @vdusek @Mantisus: I've pushed a significant refactor based on the latest feedback.

Sub-queues over memory buffer: ThrottlingRequestManager now delegates to persistent per-domain sub-queues via RequestQueue.open(alias=f"throttled-{domain}") instead of keeping throttled requests in memory.

Test Structure: Completely rewrote test_throttling_request_manager.py to drop the Test... classes and conform to Crawlee's standard test structure.

BasicCrawler fixes: Addressed all inline nits (used isinstance(), renamed url to request_url in _raise_for_session_blocked_status_code, updated docstrings/comments).

The tests track the routing origin and safely aggregate get_handled_count and is_empty metrics across the main queue and sub-queues. All 24 tests pass, and Ruff and Pytest issues have been resolved. Let me know if the updated delegation architecture feels right!

vdusek requested review from Pijukatel, janbuchar and vdusek February 23, 2026 11:48

janbuchar requested changes Feb 23, 2026

View reviewed changes

janbuchar self-requested a review February 24, 2026 10:22

janbuchar requested changes Feb 24, 2026

View reviewed changes

refactor: reimplement ThrottlingRequestManager with per-domain sub-…

1065e9b

…queues and update its integration across crawlers.

		@staticmethod
		def _parse_retry_after_header(value: str \| None) -> timedelta \| None:

		if self._throttling_manager:
		self._throttling_manager.record_success(request.url)

		# NOTE: _parse_retry_after_header has been moved to crawlee._utils.http.parse_retry_after_header

	# Wire robots.txt crawl-delay into ThrottlingRequestManager (#1396).
	# Wire robots.txt crawl-delay into ThrottlingRequestManager

Comments

Conversation

MrAliHasan commented Feb 20, 2026

Fixes #1437

Problem

Solution

How it works

Files changed

Tests

Future work

Uh oh!

vdusek commented Feb 23, 2026

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrAliHasan commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrAliHasan commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janbuchar left a comment •

edited

Loading